Updated Megatron version #85

jlamypoirier · 2023-12-21T00:50:24Z

No description provided.

Refactor DistributedOptimizer for MoE model support See merge request ADLR/megatron-lm!986

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Signed-off-by: jiemingz <jiemingz@nvidia.com>

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

add is_first_microbatch for TE See merge request ADLR/megatron-lm!1033

Need a switch at NeMo level to enable Atomic GEMM See merge request ADLR/megatron-lm!1017

Add distributed checkpoint support to non-TE based models See merge request ADLR/megatron-lm!1005

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Support for activation offloading to CPU in M-LM See merge request ADLR/megatron-lm!1016

add rope and swiglu fusion See merge request ADLR/megatron-lm!946

Add jit_fuser to switch between torch.jit.script and torch.compile See merge request ADLR/megatron-lm!1036

Run black on megatron/optimizer See merge request ADLR/megatron-lm!1050

Unify resume and correctness functional tests See merge request ADLR/megatron-lm!1070

Mcore mock multimodal dataset See merge request ADLR/megatron-lm!1147

…ommunication Compute norm once per batch (instead of once per microbatch) and once per bucket (instead of once per param)

Fix NaN checking in grads: should be performed before data-parallel all-reduce See merge request ADLR/megatron-lm!989

…ry ffn_hidden_size

Move to Draco OCI See merge request ADLR/megatron-lm!1137

Print number of transformer and embedding parameters separately See merge request ADLR/megatron-lm!1159

Mcore LLaVA model See merge request ADLR/megatron-lm!1151

[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm See merge request ADLR/megatron-lm!1013

Make throughput and memory footprint formulae compatible with arbitrary ffn_hidden_size See merge request ADLR/megatron-lm!1169

Experimental Yaml configs See merge request ADLR/megatron-lm!1134

This reverts commit fe1f23c.

shjwudp and others added 30 commits January 17, 2024 09:16

Refactor DistributedOptimizer for MoE model support

46ca3db

Merge branch 'distopt_with_moe' into 'main'

d657a3e

Refactor DistributedOptimizer for MoE model support See merge request ADLR/megatron-lm!986

Run black on megatron/optimizer

6083743

Remove hardcoded data cache path

17545b3

Change --enable-onelogger to --enable-one-logger for consistent naming

6c0e7a9

Add ImportError catch for one_logger

bf9c0a1

Add message on how to install one_logger

85c4034

Better code formatting

54de98d

Fixed merge conflicts

909bda3

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

add is_first_microbatch for TE

3c44fb9

Signed-off-by: jiemingz <jiemingz@nvidia.com>

add arg name

27879a7

Signed-off-by: jiemingz <jiemingz@nvidia.com>

add docstring and move set_is_first_microbatch

7dc2ee8

Signed-off-by: jiemingz <jiemingz@nvidia.com>

Fixed formatting

3e19c76

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Merge branch 'jiemingz/is_first_microbatch' into 'main'

bed60a8

add is_first_microbatch for TE See merge request ADLR/megatron-lm!1033

fix a bug in branch and format

cf1a1c6

Merge branch 'main' into fuse_rope_swiglu_main

036605d

fix tests

568da5a

Merge branch megatron-lm:main into atomic_gemm_switch

140642c

enable swiglu and rope fusion by default and disable them in tests

de9428a

Merge branch 'atomic_gemm_switch' into 'main'

599f558

Need a switch at NeMo level to enable Atomic GEMM See merge request ADLR/megatron-lm!1017

Merge branch 'mblaz/dist-ckpt-layernorms' into 'main'

ca8a00a

Add distributed checkpoint support to non-TE based models See merge request ADLR/megatron-lm!1005

Docstring removed for context config

79269fa

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Decoupled cpu offloading and SplitAlongDim imports

4b05862

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Merge branch 'cpu_offload' into 'main'

a5165ac

Support for activation offloading to CPU in M-LM See merge request ADLR/megatron-lm!1016

Merge branch 'fuse_rope_swiglu_main' into 'main'

640af6b

add rope and swiglu fusion See merge request ADLR/megatron-lm!946

Add jit_fuser to switch between torch.jit.script and torch.compile

473225f

Merge branch 'jaeminc/mcore-jit' into 'main'

de4028a

Add jit_fuser to switch between torch.jit.script and torch.compile See merge request ADLR/megatron-lm!1036

misc

716204e

Merge branch 'black_on_optimizer' into 'main'

8c2cd99

Run black on megatron/optimizer See merge request ADLR/megatron-lm!1050

Router and communication refactoring.

c795038

deepakn94 and others added 30 commits February 25, 2024 23:00

Print number of transformer and embedding parameters separately

9530e19

Unify resume and correctness functional tests

5f1f813

Merge branch 'mblaz/unify-resume-and-correctness-func-tests' into 'main'

70e469d

Unify resume and correctness functional tests See merge request ADLR/megatron-lm!1070

Mcore mock multimodal dataset

1fcdc95

Merge branch 'trintamaki/dummy-multimodal-dataset' into 'main'

1dada7e

Mcore mock multimodal dataset See merge request ADLR/megatron-lm!1147

Fix NaN checking in grads: should be performed before data-parallel c…

d668077

…ommunication Compute norm once per batch (instead of once per microbatch) and once per bucket (instead of once per param)

Merge branch 'check_nan_in_grad' into 'main'

53a350e

Fix NaN checking in grads: should be performed before data-parallel all-reduce See merge request ADLR/megatron-lm!989

Make throughput and memory footprint formulae compatible with arbitra…

9677b3b

…ry ffn_hidden_size

Move to Draco OCI

3dafc0e

Merge branch 'maanug/jet-oci' into 'main'

17c487a

Move to Draco OCI See merge request ADLR/megatron-lm!1137

Merge branch 'theoretical_memory_fix' into 'main'

3b0fcd1

Print number of transformer and embedding parameters separately See merge request ADLR/megatron-lm!1159

Mcore LLaVA model

7bc3c74

Merge branch 'trintamaki/llava-model-mr' into 'main'

d1acce3

Mcore LLaVA model See merge request ADLR/megatron-lm!1151

[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm

80e180d

Merge branch 'chenhany/ammo_ptq_example' into 'main'

36e9b6b

[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm See merge request ADLR/megatron-lm!1013

Merge branch 'variable_ffn_size' into 'main'

0c1e53d

Make throughput and memory footprint formulae compatible with arbitrary ffn_hidden_size See merge request ADLR/megatron-lm!1169

Experimental Yaml configs

47cb630

Merge branch 'yaml' into 'main'

8957468

Experimental Yaml configs See merge request ADLR/megatron-lm!1134

MOE support

63d9d3e

stuff

40a134a

Merge branch 'main' into compare_tensors_updated

1a96a99

Support megatron core models

fdd668c

Fix arg

4238a80

fixes

fe38434

fix

3c6652e

fix

f6b9b4b

update

cb6baf1

misc

fe1f23c

Revert "misc"

2e23b9b

This reverts commit fe1f23c.

version

511e8f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated Megatron version #85

Updated Megatron version #85

Uh oh!

jlamypoirier commented Dec 21, 2023

Uh oh!

Uh oh!

Updated Megatron version #85

Are you sure you want to change the base?

Updated Megatron version #85

Uh oh!

Conversation

jlamypoirier commented Dec 21, 2023

Uh oh!

Uh oh!